helpers
Helpers extract information from crawled pages and format it as Algolia records.
Use helpers in your recordExtractor
to make it easier to extract relevant content from your page.
Algolia has a selection of helpers:
product
article
page
splitContentIntoRecords
codeSnippets
docsearch
.
product
This helper extracts content from product pages. A “product page” is an HTML page with one of thes JSON-LD schema types:
Product
DietarySupplement
Drug
IndividualProduct
ProductCollection
ProductGroup
ProductModel
SomeProducts
Vehicle
.
Response
The helper returns an object with the following properties:
The product page’s URL.
The product page’s URL (without parameters or hashes).
The language the page content is written in (from the name
field of the JSON-LD product schema).
The sku
field of the JSON-LD schema.
The description
field of the JSON-LD schema.
The image
field of the JSON-LD schema.
The product’s price, selected from one of these JSON-LD schema fields, in the order:
offers.price
offers.highPrice
offers.lowPrice
.
The offers.priceCurrency
field of the JSON-LD schema.
The category
field of the JSON-LD schema.
article
This helper extracts content from article pages. An “article page” is an HTML page with an appropriate JSON-LD schema or meta tag:
An og:type
HTML meta tag with the value article
:
Response
The helper returns an object with the following properties:
The article’s URL.
The article’s URL (without parameters or hashes).
The language the article is written in (from the HTML lang
attribute)
The article’s headline, selected from one of these, in the order:
meta[property="og:title"]
meta[name="twitter:title"]
head > title
- First
<h1>
.
The article’s description, selected from one of these, in the order:
meta[name="description"]
meta[property="og:description"]
meta[name="twitter:description"]
.
The keywords
field of the JSON-LD schem.
Article tags: meta[property="article:tag"]
.
The image associated with the article, selected from one of these, in the order:
meta[property="og:image"]
meta[name="twitter:image"]
.
The author
field of the JSON-LD schema.
The datePublished
field of the JSON-LD schema.
The dateModified
field of the JSON-LD schema.
The category
field of the JSON-LD schema.
The article’s content (body copy).
page
This helper extracts text from pages regardless of its type or category.
Response
The helper returns an object with the following properties:
The object’s unique identifier.
The page’s URL.
The URL hostname (for example, example.com
).
The URL path: everything after the hostname.
The URL depth, based on the number of slashes after the domain.
For example, http://example.com/
= 1, http://example.com/about
= 1, http://example.com/about/
= 2.
The page’s file type.
One of: html
, xml
, json
, pdf
, doc
, xls
, ppt
, odt
, ods
, odp
, or email
.
The page length in bytes.
The page title, derived from head > title
.
The page’s description, derived from meta[name="description"]
.
The page’s keywords, derived from meta[name="keywords"]
.
The image associated with the page, derived from meta[property="og:image"]
.
The page’s section titles, derived from h1
and h2
.
The page’s content (body copy).
splitContentIntoRecords
This helper extracts text from long HTML pages and splits them into smaller chunks. This can help prevent “Record too big” errors.
Using this example record extractor on a long page returns an array of records, each one smaller than 1,000 bytes.
When splitting pages, some words will appear in records belonging to the same page. If you don’t want these duplicates to turn up when users search:
- Set
distinct
totrue
in your index.distinct: true
- Set the
attributeForDistinct
to your page’s URL. For example,attributeForDistinct: 'url'
. - Set
searchableAttributes
’ to be your page title and body content. For example,[ 'searchableAttributes: [ 'title', 'text' ]
. - Add a
customRanking
to sort from the first split record on your page to the last. For example,customRanking: [ 'asc(part)' ]
.
Response
Specify one or more response parameters in your helper to determine what information is returned.
Takes this record’s attributes (and values) and adds them to all the split records.
A Cheerio selector that determines from which elements content will be extracted. For more information, see Extracting data with Cheerio.
Maximum number of bytes allowed per record. To avoid errors, check your plan’s record size limits.
This attribute stores the sequentially generated number assigned to each record when the helper splits a page.
Name of the attribute in which to store the text of each split record.
codeSnippets
Use this helper to extract code snippets from pages.
The helper finds code snippets by looking for <pre>
tags and extracting the content
and the language class prefix from the tag.
If the crawler finds several code snippets on a page, the helper returns a list of those snippets.
Response
The helper returns an array of code objects with the following properties:
The code snippet.
The code snippet’s language (if found).
The URL of the nearest sibling <a>
tag.
Text fragment URL with the code snippet. This is a selection of text within a page that’s linked to another page.
docsearch
This helper extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy.
You can also use it without DocSearch or to index non-documentation content. For more information, see the DocSearch documentation.